Principal component analysis (PCA) is arguably the first method for dimension reduction, which dates back to papers by some of the earliest contributors to statistical theory including Karl Pearson and Harold Hotelling [@pca_pearson; @pca_hotelling]. Pearson’s original development of PCA borrowed ideas from mechanics which provides a clear geometric/physical interpretation of the resulting PCA loadings, variances, and scores, which we will define later. This interpretability and an implementation that uses scalable linear algebra methods – allowing PCA to be conducted on massive datasets – is one of the reasons PCA is still used prolifically to this day. In fact, many more modern and complex methods still rely on PCA as an internal step in their algorithmic structure.
There are number of different but equivalent derivations of PCA including the minimization of least squares error, covariance decompositions, and low rank approximations. We’ll revisit these ideas later, but first, let’s discuss PCA through a geometrically motivated lens via a method called iterative projections.
We begin with a data matrix \[{\bf X} = \begin{bmatrix} \vec{x}_1^T\\ \vdots \\\vec{x}_N^T\end{bmatrix} \in\mathbb{R}^{N\times d}.\] Let’s begin with an example of dimension reduction where we’ll seek to replace each vector \(\vec{x}_1,\dots,\vec{x}_N\) with corresponding scalars \(y_1,\dots,y_N\) which preserve as much of the variability between these vectors as possible. To formalize this idea, let’s introduce a few assumptions.
First, we’ll assume the data \(\vec{x}_1,\dots,\vec{x}_N\) are centered. This is not a requirement, but it will simplify the analysis later. We’ll discuss how to account for this centering step later, but for now assume \(\bar{x} = \vec{0}\) so that \({\bf HX} = {\bf X}\). More importantly, let’s assume that each \(y_i\) is derived in the same way. Specifically, let \(y_i = \vec{x}_i^T \vec{w}\) for some common vector \(\vec{w}\). Thus, we can view each one-dimensional representation as a dot product of the corresponding observed vector with the same vector \(\vec{w}.\) We can compactly write this expression as \[\vec{y} = \begin{bmatrix}y_1\\ \vdots \\ y_n \end{bmatrix}=\begin{bmatrix}\vec{x}_1^T \vec{w} \\ \vdots \\ \vec{x}_N^T \vec{w}\end{bmatrix} = {\bf X} \vec{w}.\]
How do we choose \(\vec{w}\)? We would like differences in the scalars \(y_1,\dots,y_N\) to reflect differences in the vectors \(\vec{x}_1,\dots,\vec{x}_N\) so having \(y_1,\dots,y_N\) spread out is a natural goal. Thus, if \(\vec{x}_i\) and \(\vec{x}_j\) are far apart then so will \(y_i\) and \(y_j\). To do this, we’ll try to maximize the sample variance of the \(y\)’s. The sample variance \[\frac{1}{N} \sum_{i=1}^N (y_i - \bar{y})^2 = \frac{1}{N}\sum_{i=1}^N(\vec{x}_i^T \vec{w} - \bar{y})^2\] will depend on our choice of \(\vec{w}\). In the previous expression, \[\bar{y} = \frac{1}{N} y_i = \frac{1}{N}\sum_{i=1}^N \vec{x}_i^T \vec{w} = \frac{1}{N}\vec{1}^T{\bf X}\vec{w}\] is the sample mean of \(y_1,\dots,y_N.\) Importantly, since we have assumed that \(\vec{x}_1,\dots,\vec{x}_N\) are centered, it follows that \(\bar{y}=0\) and the sample variance of \(y_1,\dots,y_N\) simplifies to \[\frac{1}{N}\sum_{i=1}^N(\vec{x}_i^T \vec{w})^2 = \frac{1}{N}\sum_{i=1}^N y_i^2 = \frac{1}{N} \|y\|^2 = \frac{1}{N}\vec{y}^T\vec{y}.\]
We can write the above expression more compactly. Using the identity \(\vec{y} = {\bf X}\vec{w}\), we want to choose \(\vec{w}\) to maximize \[\frac{1}{N}\vec{y}^T\vec{y} = \frac{1}{N}({\bf X}\vec{w})^T{\bf X}\vec{w} = \frac{1}{N}\vec{w}^T{\bf X}^T{\bf X}\vec{w} = \vec{w}^T\left(\frac{{\bf X}^T{\bf X}}{N}\right)\vec{w}.\] Since we have assumed that \({\bf X}\) is centered it follows that \({\bf X}^T{\bf X}/N\) is the sample covariance matrix \(\hat{\bf \Sigma}\)! Thus, we want to make \(\vec{w}^T\hat{\bf \Sigma} \vec{w}\) as large as possible.
Naturally, we could increase the entries in \(\vec{w}\) and increase the above expression without bound. To make the maximization problem well posed, we will restrict \(\vec{w}\) to be unit-length under the Euclidean norm so that \(\|\vec{w}\|=1.\) We now have a constrained optimization problem which gives rise to the first principal component loading.
The first principal component loading is the vector \(\vec{w}_1\) solving the constrained optimization problem \[\begin{equation} \begin{split} \text{Maximize } &\vec{w}^T \hat{\bf \Sigma}\vec{w} \\ \text{subject to constraint } &\|\vec{w}\|=1. \end{split} \end{equation}\] The first principal component scores are the projections, \(y_i = \vec{x}_i^T\vec{w}_1\) for \(i=1,\dots, N\), of each sample onto the first loading.
To find the first PCA loading we can make use of Lagrange multipliers (see exercises) to show that \(\vec{w}_1\) must also satisfy the equation \[\hat{\bf \Sigma}\vec{w}_1 = \lambda \vec{w}_1\] where \(\lambda\) is the Lagrange multiplier. From this expression, we can conclude that the first principal component loading is the unit length eigenvector associated with the largest eigenvalue of the sample covariance matrix \(\hat{\bf \Sigma}\) and that the Lagrange multiplier \(\lambda\) is the largest eigenvalue of \(\hat{\bf \Sigma}\). In this case, we refer to \(\lambda\) as the first principal component variance.
Since \(\|\vec{w}_1\| = 1\) we may interpret this vector as specifying a direction in \(\mathbb{R}^d\). Additionally, we can decompose each of our samples into two pieces: one pointing in the direction specified by \(\vec{w}_1\) and a second portion perpendicular to this direction. Thus, we may write \[\vec{x}_i = \underbrace{\vec{w}_1 \vec{x}_i^T\vec{w}_1}_{parallel} + \underbrace{(\vec{x}_i -\vec{w}_1 \vec{x}_i^T\vec{w}_1)}_{perpendicular}.\] By the Pythagorean theorem, \[\begin{align*} \|\vec{x}_i\|^2 &= \| \vec{w}_1 \vec{x}_i^T\vec{w}_1 \|^2 + \|\vec{x}_i -\vec{w}_1 \vec{x}_i^T\vec{w}_1\|^2 \\ &= (\vec{w}_1^T\vec{x}_i)^2 + \|\vec{x}_i -\vec{w}_1 \vec{x}_i^T\vec{w}_1\|^2 \\ &= y_i^2 + \|\vec{x}_i -\vec{w}_1 \vec{x}_i^T\vec{w}_1\|^2 \end{align*}\] for \(i=1,\dots,N\). Averaging over all of samples gives the expression \[\frac{1}{N}\sum_{i=1}^N\|\vec{x}_i\|^2 = \frac{1}{N}\sum_{i=1}^N y_i^2 +\frac{1}{N}\sum_{i=1}^N \|\vec{x}_i -\vec{w}_1 \vec{x}_i^T\vec{w}_1\|^2.\] The left-hand side of the above expression is fixed for a given set of data, whereas the first term on the right side is exactly what we sought to maximize when finding the first principal component loading. This quantity is the average squared length of the projection of each sample onto the direction \(\vec{w}_1\). As such, we can view the first principal component loading as the direction in which \(\vec{x}_1,\dots,\vec{x}_N\) most greatly varies. Let’s turn to an example in \(\mathbb{R}^3\) to view this.
Below, we show a scatterplot of \(N=500\) points in \(\mathbb{R}^3\) drawn randomly from a MVN. These data have been centered.
Notice the oblong shape of the cloud of points. Rotating this image, it is clear that the data varies more in certain directions than in others.
The sample covariance matrix of these data is \[ \hat{\Sigma} = \begin{bmatrix} 11.83&6.78&3.38 \\ 6.78&8.88&9.2 \\ 3.38&9.2&15.9 \\ \end{bmatrix} \] We can use the eigendecomposition of this matrix to find the first PCA loading and variance. Given the first loading we can also compute the first PCA scores. A short snippet of code for this calculation is shown below.
Sigmahat <- (t(data) %*% data)/N # calculate the sample covariance matrix, recall the data has been centered
Sigmahat.eigen <- eigen(Sigmahat) # calculate the eigen decomposition of Sigmahat
y <- data %*% Sigmahat.eigen$vectors[,1] # calculate the scores
The largest eigenvalue (first PC variance) is \(\lambda = 25.58\) with associated eigenvector \((0.45, 0.56, 0.69)^T\) which is the first PC loading.
We can visualize these results in a few different ways. First, we can add the span of \(\vec{w}_1\) (shown in red) to the scatterplot of the data. One can see that \(\vec{w}_1\) is oriented along the direction where the data is most spread out.Samples with Span of First Loading
We could also generate a scatterplot of the scores, but we’ll show this scores along the \(\vec{w}_1\) axes so that they correspond to the projection of each sample onto span\(\{\vec{w}_1\}\).
Decomposition of Samples into Components Parallel and Perpendicular to First Loading
The first PCA loading provides information about the direction in which are data most greatly vary, but it is quite possible that there are still other directions wherein our data still exhibits a lot of variability. In fact, the notion of a first principal component loading, scores, and variance suggests the existence of a second, third, etc. collection of these quantities. To explore these quantities, let’s proceed as follows
For each datum, we can remove its component in the direction of \(\vec{w}_1\), and focus on the projection onto the orthogonal complement of \(\vec{w}_1\). Let \[\vec{x}_i^{(1)} = \vec{x}_i - \vec{w}_1\vec{x}_i^T\vec{w}_1 = \vec{x}_i - \vec{w}_1 y_i\] denote the portion of \(\vec{x}_i\) which is orthogonal to \(\vec{w}_1\). Here, the superscript \(^{(1)}\) indicates we have removed portion of the vector in the direction of the first loading.
We can organize the orthogonal components into a new data matrix \[{\bf X}^{(1)} = \begin{bmatrix} \left(\vec{x}_1^{(1)}\right)^T \\ \vdots \\ \left(\vec{x}_N^{(1)}\right)^T \end{bmatrix} = \begin{bmatrix} \vec{x}_1^T - \vec{x}_1^T\vec{w}_1\vec{w}_1^T \\ \vdots \\ \vec{x}_N^T - \vec{x}_N^T\vec{w}_1\vec{w}_1^T \end{bmatrix} = {\bf X} - {\bf X}\vec{w}_1\vec{w}_1^T.\] Now let’s apply PCA to the updated data matrix \({\bf X}^{(1)}\) from which we get the second principal component loading, denoted \(\vec{w}_2\), the second principal component scores, and the second principal component variance. One can show that the data matrix \({\bf X}^{(1)}\) is centered so that its sample covariance matrix is \(\hat{\bf \Sigma}^{(1)} = \frac{1}{N}({\bf X}^{(1)})^T{\bf X}^{(1)}.\) Thus, the second PCA loading, \(\vec{w}_2\), is a unit eigenvector associated with the largest eigenvalue of \({\bf \Sigma}^{(1)}\). This eigenvalue is the 2nd PCA variance and the 2nd PCA score of \(\vec{x}_i\) is given by the inner product of \(\vec{x}_i^{(1)}\) with \(\vec{w}_2\).
Here is one crucial observation. The vector \(\vec{w}_2\) gives the direction of greatest variability of the vectors \(\vec{x}_1^{(1)},\dots,\vec{x}_N^{(1)}.\) For each of these vectors we have removed the component in the direction of \(\vec{w}_1\). Thus, \(\vec{x}_1^{(1)},\dots,\vec{x}_N^{(1)}\) do not vary at all in the \(\vec{w}_1\) direction. What can we say about \(\vec{w}_2\)? Naturally, it must be perpendicular to \(\vec{w}_1\)!
We need not stop at the second PCA loading, scores, and variance. We could remove components in the direction of \(\vec{w}_2\) and apply PCA to the vectors \[\begin{align*} \vec{x}_i^{(2)} &= \vec{x}_i^{(1)} - \vec{w}_2 (\vec{x}_i^{(1)})^T\vec{w}_2\\ &= \vec{x}_i - \vec{w}_1\vec{x}_i^T\vec{w}_1 - \vec{w}_2(\vec{x}_i - \vec{w}_1\vec{x}_i^T\vec{w}_1)^T\vec{w}_2\\ &= \vec{x}_i - \vec{w}_1\vec{x}_i^T\vec{w}_1 - \vec{w}_2\vec{x}_i^T\vec{w}_2 + \vec{w}_2\vec{w_1}^T\vec{x}_i\underbrace{\vec{w}_1^T\vec{w}_2}_{=0} \end{align*}\] to obtain a third loading, variance, and set of scores. We can continue repeating this argument \(d\) times for our \(d\)-dimensional data until we arrive at a set of \(d\) unit vectors \(\vec{w}_1,\dots,\vec{w}_d\) which are the \(d\) PCA loadings.
To review, we then have \(d\) PCA loadings \(\vec{w}_1,\dots,\vec{w}_d\) each with an associated PCA variance \(\lambda_1,\dots,\lambda_d\). For each sample, we also have an associated set of PCA scores \(\vec{x}_i^T\vec{w}_1,\dots,\vec{x}_i^T\vec{w}_d\) which we can organize into a large matrix \[{\bf Y} = \begin{bmatrix} \vec{x}_1^T\vec{w}_1 & \dots & \vec{x}_1^T\vec{w} \\ \vdots & \vdots & \vdots \\ \vec{x}_N^T\vec{w} & \dots & \vec{x}_N^T\vec{w}_d \end{bmatrix} = {\bf X}\begin{bmatrix} \vec{w}_1 \,| \dots \,|\,\vec{w}_d\end{bmatrix}.\]
We formalize this idea in the following Lemma.
Suppose vectors \(\vec{x}_1,\dots,\vec{x}_N\in\mathbb{R}^d\) are centered and have sample covariance matrix \(\hat{\bf \Sigma}\). Let \(\lambda_1 \ge \dots\ge\lambda_d \ge 0\) denote the eigenvalues of \(\hat{\bf \Sigma}\) with associated eigenvectors \(\vec{w}_1,\dots,\vec{w}_d.\) Not let \(\vec{x}^{(k)}_i = \vec{x}_i - \sum_{j=1}^k \vec{w}_j\vec{x}_i^T\vec{w}_j\) denote the portion of vector \(\vec{x}_i\) which is orthogonal to each of the first \(k\) PCA loadings. If \(\hat{\bf \Sigma}^{(k)}\) denotes the sample covariance matrix of these vectors it follows that \(\hat{\bf \Sigma}^{(k)}\) has eigenvalues \(\underbrace{0,\dots,0}_{k}\), \(\lambda_{k+1}\ge \dots \ge \lambda_d\ge 0\) with associated eigenvectors \[\underbrace{\vec{w}_1,\dots,\vec{w}_k,}_{\text{each with eigenvalue 0}} \vec{w}_{k+1},\dots,\vec{w}_d.\]
Verifying this Lemma is left as an exercise.
For each loading, we have a corresponding set of PCA scores and PCA variances. Importantly, since are data are \(d\)-dimensional and our loadings are \(d\)-mutually orthogonal unit vectors, they define a new basis in \(\mathbb{R}^d\).
Let’s continue our example above to see this process in the case where \(d=3\).
We can continue this process by also removing, from each vector, its component in the direction of \(\vec{w}_2.\)
Finally, we can plot the data in the basis defined by the loadings. This is equivalent to a scatterplot of the scores.
plot_ly(data.frame( x = y[,1], y= y[,2], z = y[,3]),
x = ~x, y = ~y, z = ~z,
marker = list(size = 5)) %>%
add_markers() %>%
layout(scene = list( xaxis = list(title = "y<sub>1</sub>"),
yaxis = list(title = "y<sub>2</sub>"),
zaxis = list(title = "y<sub>3</sub>")),
title = "Principal Component Scores"
)
In this basis, the data forms an ellipsoid with principal axis directed along the \(y_1,\,y_2, \text{ and }y_3\) axes. This is a result of the uncorrelated nature of principal component scores.